London is a fascinating city with many faces and, as a background of this project, we will assume that a group of stakeholders has already decided to invest in its real estate market.
Given that, the objective of this project is to find newly built residential properties that match their investment profile, on one hand, and possess the potential to increase their future value, on the other. We will accomplish this by defining a rating system based on the residential properties characteristics, on the relevant trends about the area they are located in and on their surrounding venues.
First, we will look for available datasets from which we can gather some insights about future housing market trends in various areas of London. Afterwards we will pick the most promising areas and look for residential unit on sale in their perimeter browsing housing specialized websites. We will then inspect each property surroundings using the Foursquare API to see what kind of venues are nearby and further characterize each property.
Combining all these information about local housing market trends, properties features and their surroundings, this document proposes and enforces a methodology to approach the residential housing market so that an investor is finally able to make an aware buying decision.
Even though this document is intended for a group of stakeholders that is interested in the London housing market for a real estate investment, it is also interesting for the small investor that wants to settle in London buying a new home with certain characteristics and surroundings and that has a good chance to increase its value in the future.
We will acquire data from different sources on three main subjects:
As local areas, we will use the London boroughs geographical subdivision and the main source of data will be the London Datastore https://data.london.gov.uk/. It has been created by the Greater London Authority (GLA) as a first step towards freeing London’s data. It will provide us with a huge amount of historical data and projections grouped by boroughs or wards on a wide range of topics.
Looking at available London data at borough level, we will pick those that likely will have an impact on the housing market trend, namely:
As said, all the information have been acquired from London Datastore, apart from City of London Crimes data which comes from the UKCrimeStats website (www.ukcrimestats.com), an open data platform of the Economic Policy Centre (www.economicpolicycentre.com). In the methodology section of this document we will further describe every dataset analyzed that will be mainly in the time series form. Nevertheless, the common goal will be to produce a ratings system that synthesizes the impact of the described local feature on the future housing market value in that same area.
Afterwards, we will acquire data about actual residential properties analyzing one of the most important UK website for properties for sales: www.rightmove.co.uk.
This source will provide us with these information about actual residential property on sales in the London area:
These data will give us the ability to estimate further property characteristics like "spaciousness" and "affordability" to better match investor preferences.
Finally, with the location coordinates acquired in the previous step, we will use the Foursquare API to acquire nearby venues. We will subdivide venues in these categories:
and count the number of venues nearby in each category to characterize the surroundings of each residential unit.
At this point, we will have all the information to completely rate each single property. We will then define a final rating system that we could match with an investor profile to obtain a personalized recommended buying list.
It is divided in two phases.
In the first phase we will assess the boroughs quality in relation to a potential residential investment.
We will analyze the various data collected to evaluate predictions for the imminent future. Based on these predictions and the impact of each feature on the housing market value, we will define for each borough a rating on each subject analyzed.
Predictions about the imminent future are fundamental for our goal. We have to consider that the current value of each feature is correlated to the current housing market value, so by itself is not sufficient to give us the information we want, namely the likelihood of a property value increase. Indeed, the current value define both the current selling and buying prices so it is impossible to gain information about what will be the prices in the future. In other words, if we invest today we want to find the borough where there's a high possibility of a value increase of our investment.
The mentioned rating will reflect the impact on the housing market of the selected study topics and for each one of them every borough will be graded from zero (worst) to one (best). A good rating will mean that feature will likely contribute to a positive increment of house market value in that borough.
The rating itself will be evaluated with the relative increment between the future value and the current value of the given feature with the formula:
We will assess the future values at one year from December 2020.
With all the increments calculated for a given feature we will obtain the final ratings simply applying a min-max normalization in the zero-one range so that for every feature we will have a best borough rated at one and a worst rated at zero.
Almost all the dataset we will analyze in this phase are time series and some of them contain also predictive data. In the lack of a future value, we will extrapolate one using a polynomial regression model and, since we want to capture just the overall trend, we will keep a low polynomial degree.
All the ratings will be stored in a dataframe and we will perform clustering analysis using K-means and DBSCAN algorithms to obtain some clusters with interesting characteristics with the purpose of selecting some promising boroughs that we will use for the second phase.
In the second phase we will use a web scraping algorithm to acquire data about new residential property on sales in the most promising boroughs and run several iterations of the Foursquare "explore" API to find surrounding venues for each property.
Based on the data gathered we will run several clustering algorithm to extract the most interesting groups of properties that are suitable for our goal.
We start with the construction of a dataframe containing basic information about the 33 London boroughs:
We will use the authority codes as a base to uniform boroughs naming among the datasets and the others for the valuation of the various grades.
From this dataframe we derive also an empty dataframe we will fill with boroughs ratings in the various next steps.
Below, the reader finds the aforementioned boroughs table.
Analysing the housing market historical series is surely important to comprehend the possible future trend.
The importance of this dataset is in the fact that all the other data we will collect about boroughs have only a partial impact on the future housing market trends. In other words, there are surely other influential components that we cannot assess simply because they are not recorded and so they are not measurable. However, reading the historical housing market trends will give us information about how those hidden components acted in the past. Consequently, when we predict future based on these data, somehow we will keep them into account.
For this task we will use data from 1995 to 2020 available on London Datastore website. The main source of this dataset is the UK HM Land Registry. From this dataset we will acquire, for each borough, the historical average house sales prices and we will extrapolate future values using a polynomial regression model. Specifically we will fit two regression models for each borough using short term data and long term data. That will give us two regression curves that represent the short and the long term housing market trend. We will than acquire both curve point at future time to define a predicted value and finally we will calculate the relative increase from the current value to future value.
Finally, we will assign two ratings scaling all the relative increases foreseen by the short term model so that the most performing borough will be graded one and the least performing will be graded zero. Of course, we will do the same for the long term model.
This phase is concluded with the recording of the so evaluated ratings in the ratings dataframe.
In the following plots, we can see the various observations during the years and the second degree regression curves that best fits the data. Visually, the more the right end of the curves points up the better is the rating. The plots are interactive and two different boroughs can be selected to compare performances, by default the best and the worst performers by ratings are presented.
A population increase is a direct cause of residential properties demand increase and, on final, of housing market value increase. This said, is important for our scope to verify demographics data. We will do this using datasets from London Datastore which include, for each London borough, Greater London Authority demographics estimates (2016-based projections), 2011 Census and mid-year estimates by UK Office for National Statistics. Luckily for us, this dataset contains also predictions on future years till 2050 so we will not need to use predictive models in this phase. To grade each borough, we will simply evaluate the relative increase of population density, from the current date to the future date, and we will perform the same scaling done in the previous phase. Finally we will store the data in the ratings dataframe.
In the following plots, we can see the various observations and predictions during the years and the slope of the line connecting the last two observations represents the relative increment. Visually, the more the line right end points up the better is the rating. The plots are interactive and two different boroughs can be selected to compare performances, by default the best and the worst performers by ratings are presented.
In this paragraph we analyse trends of personal earnings from workers in different London boroughs. We assume that people tend to desire to live as near as possible to the workplace, so we will analyse a dataset that includes median income by workplace and not by residence.
The dataset is provided by UK Office for National Statistics (ONS) and provides information about earnings of employees who are working in an area, who are on adult rates and whose pay for the survey pay-period was not affected by absence.
As done previously, we will evaluate past trends with polynomial regression model and predict eventual short term future increases which we will then translate into boroughs ratings.
In the following plots, we can see the various observations during the years and the second degree regression curve that best fits the data. Visually, the more the right end of the curve points up the better is the rating. The plots are interactive and two different boroughs can be selected to compare performances, by default the best and the worst performers by ratings are presented.
An increase in housing availability in a specific area it is a negative factor because it will expand the offer and consequently will shrink the housing market value in that area. Therefore, we need to know how many residential properties will be available in the future and, luckily for us, through the London Datastore we have access to the housing approvals recorded on the London Development Database (LDD).
The LDD contains details of all planning consents meeting criteria agreed with the London boroughs, who are responsible for submitting data to the database. Only planning consents are recorded on the database. For details of applications being considered by local planning authority (borough), or for refusals, we should visit each relevant planning authority’s website. For the sake of simplicity, we will assume that the refusal rate is a constant percentage among all boroughs, thus will not affect our computations.
To rate each borough in this category we will assess how many permissions have been completed in the last 24 months. This data, considering building phase timings, will gives us a foresee about how many new residential units will be on the market in the next future. We also have to take into account that a given number of new residential properties impacts differently on each borough based on its demographics. Thus, we will divide the above value by the borough population to obtain a parameter we can use to compare boroughs among themselves.
The bar plot below shows for each borough the number of new residential units for a thousand of inhabitants that are about to enter the housing market in the next years. The rating itself is also noted above each bar and represented by its colour. Visually, the higher the bar, the worse the rating.
It is glaring that crime level and land value are inversely proportional. To acquire crimes data we will use historical series provided by London Metropolitan Police Service through the London Datastore website. We will merge two time series, one for the last 24 months and the second for the less recent observations starting from April 2010.
The aforementioned dataset will not contain information about the City of London borough because City of London Police is responsible for the safety of everyone in the 'Square Mile', not the Metropolitan Police Service.
Since we want to complete the dataset, we will acquire this information from the first table in the webpage www.ukcrimestats.com/Police_Force/City_of_London_Police and merge the data.
The final dataset contains crime observations from December 2010 upwards in every London borough.
In the following plots, we can see the various observations during the years and the second degree regression curve that best fits the data. Visually, the more the right end of the curve points down the better is the rating. The plots are interactive and two different boroughs can be selected to compare performances, by default the best and the worst performers by ratings are presented.
We will use datasets from a study commissioned to the London King’s College London by Transport for London and the Greater London Authority.
You can read the full report here: https://www.london.gov.uk/what-we-do/environment/pollution-and-air-quality/modelling-long-term-health-impacts-air-pollution-london.
This study used a computer simulation to estimate the long-term health impacts from 2016 to 2050 of the Ultra Low Emission Zone (ULEZ) and the wider suite of policies included in the London Environment Strategy (LES). Specifically, this study estimates the health impacts of the change in concentration of two pollutants: Nitrogen Dioxide (NO2) and Particulate Matter (PM2.5). These pollutants are known to have long-term health effects.
For each borough we will pick the sum of both NO2 and PM2.5 related diseases incidence for each year from 2016 to 2021 and the rating will be evaluated considering the relative increase of incidence from 2020 value to 2021 value. Naturally the lower will be the increase, the better the rating.
In the following plots, we can see the various observations and predictions during the years and the slope of the line connecting the last two observations represents the relative increment. Visually, the more the line right end points down the better is the rating. The plots are interactive and two different boroughs can be selected to compare performances, by default the best and the worst performers by ratings are presented.
Now that all the data about boroughs have been collected and analysed and boroughs performances have been measured, let's have a look at our final ratings dataframe displayed below.
In the next paragraph, we will try to use machine learning clustering algorithms to see if we can obtain some clusters populated with boroughs that are good candidates for further in-depth study.
We will use K-means and DBSCAN algorithms for this task and, since they are both based on the concept of Euclidean distance, we will use a weights system applied on the features that will shrink the variation range of those we will assume less important for our scenario. Indeed shrinking a feature range means that samples "diversity" (distance) in that particular feature will be reduced, so it will produce unlikely a new clusters. Samples that are different by reduced range features, will most likely end up spreading in clusters created by samples that are different by full range features.
That said, according to our personal feelings about how the discussed topics could affect the housing market in the upcoming years, here is the weights we will use:
First let's have a look at K-means clustering. This algorithm aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centroid) so as to minimize the within-cluster sum of squares (variance). Since the number of cluster has to be given as a starting parameter for the algorithm, a trial and error approach has to be taken to assess the optimal number of clusters to consider.
To do so, we will iterate the algorithm with various starting number of clusters and random centroids positions and we will measure its performances using two different metrics: Silhouette and Inertia.
Silhouette score tells how far away the datapoints in one cluster are, from the datapoints in another cluster. The range of silhouette score is from negative one (worst) to positive one (best). Values near zero indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
Inertia tells how far away the points within a cluster are. The range of inertia’s value starts from zero (best) and goes up.
In the picture below we can see the trends of inertia and silhouette score for various iterations of the K-means algorithm at different number of clusters values and random centroids starting positions. Only the random position that generate the best silhouette score is showed in the plot. Overall 140 iterations have been evaluated and we can identify two interesting points at number of clusters equal three and four. Both points have a silhouette score slightly above 0,3. We think point at four is a bit more interesting even if it has a tiny smaller silhouette score. At this point the inertia is smaller and its plot shows also a small elbow. The elbow point in the inertia graph is interesting because after that the change in the value of inertia is less significant.
In the following picture we further analyse the K-means algorithm output with number of clusters set at 4.
What we understand immediately, watching the ratings distribution in each cluster, is that there is not a group that performs well in all categories.
Focusing on the more relevant features according to the previous weights defined, let's point out some clues we can see:
Finally, let's have a look at the K-means clusters composition in the graph below.
Density-based spatial clustering of applications with noise (DBSCAN) is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed together if their distance is less than "epsilon" and their number is more or equal to "min samples". It starts with an arbitrary starting point that has not been visited. This point's neighbourhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labelled as noise.
We will iterate through several values of epsilon and min samples values until we find a satisfying output. To monitor the quality of each iteration we will keep track of three parameters:
Increasing epsilon reduce the noise, because more and more isolated points are reached by larger and larger growing clusters, and augments the size of the largest cluster, since all the clusters formed tend to merge the one each other. If the data are effectively clusterable, we should find a trade-off value where the noise is low but the largest cluster size is not already too much high.
The graph below shows the aforementioned parameters tendency at different values of epsilon and minimum sample size. We can observe how data tend to form a unique big cluster surrounded by noisy samples. We think an interesting point is at epsilon equal 0.41 and min samples equal to 2 with 3 clusters formed and noise equal to 6.
In the following picture we further analyse the DBSCAN algorithm output with the aforementioned parameters.
Focusing on the more relevant features we can see:
Again, let's have a look at the DBSCAN clusters composition in the graph below.
So far we have understood that each boroughs has some pros and cons and similarly the clusters formed by the two algorithm used. To pick some interesting boroughs to further investigate in the residential properties evaluation stage, we need to summarize the various ratings for each one of them so that we end up with an overall measure.
We will call this measure the "Residential Investment Score" (RIS) and it will be the weighted sum of all the previous ratings, using the weights system already defined. The score itself will be scaled with a min-max normalization in the zero-one range, one will be the best and zero the worst.
With this new parameter it's now straightforward how to pick the right boroughs. Indeed in the bar plot below, we can see the RIS score calculated for every borough on the y axis. For visual reference, the highest and greenest the bar the better the score.
As we can see we have an uncontested winner by a margin of more than 20%.
Focusing on the best performers above average RIS score (0,6) we can see that they are included in all K-means clusters apart N°3, as predicted, and in DBSCAN clusters N°0 and N°1.
Before going deep into the residential properties analysis phase, let's have an overlook at what we have finally found about boroughs features and their relations to future housing market value.
In the bar plots below we can see the ratings achieved by each borough, the ratings mean value and the Residential Investment Score (RIS). These plots are presented in an interactive form and the reader can choose which borough to display for a one to one detailed comparison. Initially, the best and the worst RIS performer are shown by default.
In the choropleth map below we can further visualize how each borough score in the various category that we can choose interactively by the legend radio buttons. Hovering the mouse on a borough, we can display all its ratings.
Inspecting this map we can discover how the geography influences some way the various performances.
We can see:
Finally, we can see how the RIS score tells us that, with some exceptions, outer boroughs seem more promising than inner boroughs for a residential investment.